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Abstract 

We develop and analyze A/-estimation methods for divergence functionals and the likelihood 
ratios of two probability distributions. Our method is based on a non-asymptotic variational 
characterization of /-divergences, which allows the problem of estimating divergences to be 
tackled via convex empirical risk optimization. The resulting estimators are simple to implement, 
requiring only the solution of standard convex programs. We present an analysis of consistency 
and convergence for these estimators. Given conditions only on the ratios of densities, we show 
that our estimators can achieve optimal minimax rates for the likelihood ratio and the divergence 
functionals in certain regimes. We derive an efficient optimization algorithm for computing our 
estimates, and illustrate their convergence behavior and practical viability by simulations!^ 

1 Introduction 

Divergences (or pseudo-distances) based on likelihood ratios between pairs of multivariate proba- 
bility distribution densities play a central role in information theory and statistics. For instance, 
in the asymptotic analysis of hypothesis testing, the Kullback-Leibler and Chernoff divergences 
control the decay rates of error probabilities (e.g., see Stein's lemma [7j and its variants). As a 
particular case of the Kullback-Leibler divergence, the mutual information specifies capacities in 



^Preliminary versions of this work were presented at the International Symposium on Information Theory (2007) 
[18j and the Neural Information Processing Systems Conference (2007) [17| . 
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channel coding coding and data compression [7]. In statistical machine learning and signal pro- 
cessing, divergences between probability distributions are frequently exploited as metrics to be 
optimized, such as in independent component analysis [6j and decentralized detection |25j. 

In all of these settings, an important problem is that of divergence estimation: how to estimate 
the divergence between two multivariate probability distributions, say P and Q, based on a set of 
samples from each distribution? A canonical example is estimation of the Kullback-Leibler (KL) 
divergence from samples. This problem includes as a special case the problem of estimating the 
mutual information, corresponding to the KL divergence between a joint distribution and the prod- 
uct of its marginals, as well as the problem of estimating the Shannon entropy of a distribution P, 
which is related to the KL divergence between P and the uniform distribution. Several researchers 
have studied the problem of Shannon entropy estimation [9l [131 [IH] based on various types of non- 
parametric techniques. Somewhat more generally, the problem of estimating an integral functional 
of a single density has been studied extensively, dating back to early work [121 [T6] from 1970s, and 
continuing on in later research [2l[3l[l5]. More recent recent work by Wang et al. [29] has developed 
algorithms for estimating the KL divergence between a pair of continuous distributions P and Q, 
based on building data-dependent partitions of equivalent (empirical) Q-measure. Wang et al. |30] 
also proposed an interesting nonparametric estimator of the KL divergence using 1-nearest neighbor 
technique. Both estimators were empirically shown to outperform direct plug-in methods, but no 
theoretical results on convergence rates were provided. 

In this paper, we propose methods for estimating divergence functionals as well as likelihood 
density ratios based on simple M-estimators. Although our primary interest is the Kullback-Leibler 
divergence, our methodology is more broadly applicable to the class of Ali-Silvey distances, also 
known as /-divergences [H |8] . Any divergence in this family, to be defined more formally in the 
sequel, is of the form ^'^(P, Q) = J (j){d'Q/clF)clF, where (p is a convex function of the likelihood 
ratio dQ/dF. 

Our estimation method is motivated by a non-asymptotic characterization of /-divergence, due 
independently to several authors [5l [T^ 119] . Roughly speaking, the main theorem in [1^ states 
that that there is a correspondence between the family of /-divergences and a family of losses such 
that the minimum risk is equal to the negative of the divergence. In other words, any negative 
/-divergence can serve as a lower bound for a risk minimization problem. This correspondence 
provides a variational characterization, by which the divergence D^{F,Q) can be expressed as the 
maximum of an Bayes decision problem involving two hypotheses P and Q. In this way, as we 
show in this paper, estimating the divergence D^{¥,Q) has an equivalent reformulation as solving 
a certain Bayes decision problem. This reformulation leads to an M-estimation procedure, in which 
the divergence is estimated by solving the convex program defined by the Bayes decision problem. 
This approach not only leads to an M-estimation procedure for the divergence but also for the 
likehhood ratio dF/dQ. 

Our second contribution is to analyze the convergence and consistency properties of our estima- 
tors, under certain assumptions on the permitted class Q of density ratios, or logarithms of density 
ratios. The analysis makes use of some known results in empirical process theory for nonparametric 
density estimation [261 128] . The key technical condition is the continuity of the suprema of two 
empirical processes, induced by P and Q distributions respectively, with respect to a metric defined 
on the class Q of permitted functions. This metric arises as a surrogate lower bound of a Bregman 
divergence defined on a pair of density ratios. If ^ is a smooth function class with smooth param- 
eter Q > d/2, it can be shown that our estimates of the likelihood ratio and the KL divergence are 
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both optimal in the minimax sense with the rate n °/('^+2«) ^nd n 



respectively. 



Our third contribution is to provide an efficient implementation of one version of our estimator, 
in which the function class Q is approximated by a reproducing kernel Hilbert space (RKHS) [21j . 
After computing the convex dual, the estimator can be implemented by solving a simple con- 
vex program involving only the Gram matrix defined by the kernel associated with the RKHS. 
Our method thus inherits the simplicity of other kernel-based methods used in statistical machine 
learning [22^ I23j. We illustrate the empirical behavior of our estimator on various instances of KL 
divergence estimation. 

The remainder of this paper is organized as follows. In Section [21 we provide the variational 
characterization of /-divergences in general, and KL divergence in particular. We then develop an 
M-estimator for the KL divergence and the likelihood ratio. Section [3] is devoted to the analysis of 
consistency and convergence rates of these estimators. In Section U we develop an efficient kernel- 
based method for computing our M-estimates, and provide simulation results demonstrating their 
performance. In Section [5l we discuss our estimation method and its analysis in a more general 
light, encompassing a broader class of /-divergences. We conclude in Section [H 

Notation: For convenience of the reader, we summarize some notation to be used throughout the 
paper. Given a probability distribution P and random variable / measureable with respect to P, we 
use J" /dP to denote the expectation of / under P. When P is absolutely continuous with respect to 
Lesbesgue measure, say with density po, this integral is equivalent to the usual Lebesgue integral 
/ fpodfj, = f f {x)po{x) fj,{dx) . Given samples X^-^\ . . . , X^^^ from P, the empirical distribution P„ is 
given hy Fn = ^ Yl^=i corresponding to a sum of delta functions centered at the data points. 

We use J fdFn as a convenient short-hand for the empirical expectation ^ X]"=i f{X^^^). 

2 M-estimators for KL divergence and the density ratio 

We begin by defining /-divergences, and describing a variational characterization in terms of a Bayes 
decision problem. We then exploit this variational characterization to develop an M-estimator. 

2.1 Variational characterization of /-divergence 

Consider two probability distributions P and Q, with P absolutely continuous with respect to 
Q. Assume moreover that both distributions are absolutely continuous with respect to Lebesgue 
measure /i, with densities po and qo, respectively, on some compact domain X C M^. The Kullback- 
Leibler (KL) divergence between P and Q is defined by the integral 



J qo 

This divergence is a special case of a broader class of divergences known as Ali-Silvey distances or 
/-divergence fBl, 1] , which take the form 



where (/> : R ^ R is a convex and lower semi-continuous (l.s.c.) function. Different choices of (f) 
result in many divergences that play important roles in information theory and statistics, including 




(1) 




(2) 
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not only the KL divergence ([T]) but also the total variational distance, the Hellinger distance, and 
so on; see [23] for further examples. 

We begin by stating and proving a variational representation for the divergence D,^. In order 
to do so, we require some basic definitions from convex analysis [20l[TT]. The subdifferential d4>{t) 
of the convex function at a point t G M is the set 

d(l){t) := {z £ M \ (pis) > (l){t) + z {s - t) VsGR}. (3) 

As a special case, if (p is differentiable at t, then d(f){t) = {(f)'{t)}. The function (p* is the conjugate 
dual function associated with cp, defined as 

(p*{v) := sup {u V — (p{u)} . (4) 

With these definitions, we have: 

Lemma 1. For any class of functions T mapping from X to M, we have the lower hound 

D^{¥, Q) > sup f[fdq- (P*{f) dF] . (5) 

Equality holds if and only if the subdifferential 50(go/po) contains an element of J-'. 



Proof. Since (p is convex and lower semi-continuous, Fenchel convex duality [20] guarantees that we 
can write cp in terms of its conjugate dual as cp{u) = sup^gjg {uv — (p*{v)^. Consequently, we have 

D4F,q) = I PosMfqo/po-(p*{f))dfi 



sup j [fqo - (p*if)Po] dfi 
sup j [f dQ-(P*{f)dF], 



f 

where the supremum is taken over all measurable functions / : — > M. It is simple to see that 
equality in the supremum is attained at a function / such that qo/po £ d(p*{f) where qo,po and / 
are evaluated at any x £ X. By convex duality, this is true if / G d(p{qo/po) for any x £ X. □ 



2.2 M-estimators for the KL divergence and likelihood ratio 

We now describe how the variational representation ([5]) specializes to an M-estimator for the 
Kullback-Leibler (KL) divergence. As a particular /-divergence, the KL divergence is induced by 
the convex function 

^(,) . I -MS) fort>0 
^ ^ \+oo for t < 0. 

A short calculation shows that the conjugate dual takes the form 

(P*{v) = l if n < 0, and 

1 +00 otherwise. 
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As a consequence of Lemma [H we obtain the following representation of the Kullback-Leibler 
divergence: Dk{^,Q) = supj<o / f dQ - /[-I - log{- f)]dF. After the substitution g = -/, this 
can be written as 

DKiF,q) = sup [ log gdF- [ gdQ + l, (8) 

g>0 J J 

for which the supremum is attained at go = Po/qo- 

We now take a statistical perspective on the variational problem ([8]), where we assume that the 
distributions P and Q are unknown. We suppose that we are given independent and identically dis- 
tributed (i.i.d.) samples, say X^-^^ , X^^) , . . . , X^") drawn i.i.d. from P, and 1"*^-^^ , Y^'^'^ , . . . , F^") drawn 
i.i.d. from Q. Denote by the empirical distribution defined by the samples {X^^^ ; • • • , X^"'^}, given 
explicitly by P„ := ^ X]"=i '^^(i) > with the empirical distribution Q„ associated with {Y^^^ , • • • , y^"^} 
defined analogously. We consider two classes of estimators: 

Estimator El: Given the empirical distributions, we consider the estimator obtained by replacing 
the true distributions P and Q with their empirical versions, and maximizing over some pre-specified 
class Q of functions g : X ^ M+, as follows: 

gn = argmaxggg j log 5 (iP„ - j 9 dQn, (9a) 

Dk = j log?n dFn - jgn dQn + 1. (9b) 

Assuming that ^ is a convex set of functions, the implementation of the estimator requires solving 
a convex optimization problem, albeit over an (infinite-dimensional) function space. For this esti- 
mation method to work, several conditions on Q are required. First, so as to control approximation 
error, it is natural to require that Q is sufficiently rich so as to contain the true likelihood ratio go 
in the sense of KL divergence, i.e., there is some g (z G such that g = go a.e.. On the other hand, 
Q should not be too large, so that estimation is possible. To formalize this condition, let I{g) be 
a measure of complexity for g, where I is a non-negative functional and I{go) < 00. Given some 
fixed finite constant M* > I {go), we then define 

g := g^,, := {g : I{g) < M*}. (10) 



Estimator E2: In practice, the "true" I{go) is not known, and hence it is not clear as a practical 
matter how to choose the fixed M* defining estimator El. Thus we also consider an approach that 
involves an explicit penalty 1(g)- In this approach, let 

G = Ui<M<oo^M- (11) 
The estimation procedure involves solving the following program: 

dn = argmin^gg y 5dQ„ - y logg dP„ + ^-^^(5), (12a) 
Dk = [ log gn dFn - [dndQn + l, (12b) 
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where A^, > is a regularization parameter. 

As we discuss in Section HI for function classes defined by reproducing kernel Hilbert spaces, 
problems (I9b|l and (112aP can actually be posed as a finite-dimensional convex programs (in n 
dimensions), and solved efficiently by standard methods. In addition to the estimate Dk of the 
KL divergence, if the supremum is attained at 'gn, then gn is an M-estimator of the density ratio 
50 ■=Po/qo- 

In the next section, we present results regarding the consistency and convergence rates of both 
estimators. While these methods have similar convergence behavior, estimator El is somewhat 
simpler to analyze and admits weaker conditions for consistency. On the other hands, estimator E2 
seems more practical. Details of algorithmic derivations for estimator E2 are described in Section [H 

3 Consistency and convergence rate analysis 

For the KL divergence functional, the difference \Dk—Dk{^, Q)| is a natural performance measure. 
For estimating the density ratio function, this difference can also be used, although more direct 
metrics are customarily preferred. In our analysis, we view qq = Po/qo as a density function with 
respect to Q measure, and adopt the (generalized) Hellinger distance as a performance measure for 
estimating the likelihood ratio function: 

hligo^g) ■=\j{^o- Vaf dQ. (13) 
3.1 Consistency of estimator El 

Our analysis of consistency relies on tools from empirical process theory. Let us briefly review the 
notion of the metric entropy of function classes (see, e.g., [28] for further background). For any 
r > 1 and distribution function Q, define the empirical metric 

llffllLw) •= / lal^'dQ, 

and let Lr{Q) denote the metric space defined by this distance. For any fixed 5 > 0, a covering for 
function class G using the metric Lr(Q) is a collection of functions which allow Q to be covered using 
Lr(Q) balls of radius 5 centered at these functions. Letting Ns{G, -^r(Q)) be the smallest cardinality 
of such a covering, then TCs{0, Lr{Q)) := log Ns{Q, Lr{Q)) is called the entropy for G using the Lr{Q) 
metric. A related notion is entropy with bracketing. Let {Q, Lr{Q)) be the smallest value of 
for which there exist pairs of functions {gjjgf} such that Wgj' — gj'WiriQ) ^ and such that for 
each g £ g there is a j such that gf < g < gf. Then nf{Q,Lr{Q)) := log A^f (g, L^(Q)) is called 
the entropy with bracketing of Q. Define the envelope functions: 

g{x^ 

Go{x) = sup\g{x)\, and Gi(a;) = sup | log — — — -j. (14) 
g&g geg 9o{x) 

For the estimator El, we impose the following assumptions on the distributions P, Q and the 
function class Q. 
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Assumptions. 



(A) . The KL divergence is bounded: D/<(P, Q) < oo. 

(B) . There is some g G G such that g = go almost surely (a.s.). 

In the following theorem, the almost sure statement can be taken with respect to either P or 
since they share the same support. 

Theorem 1. Suppose that assumptions (A) and (B) hold, 
(a) Assume the envelope conditions 



I 



GodQ < oo (15a) 

GidF < oo (15b) 

and suppose that for all 6 > there holds: 

-ns{G-go,Li{qn)) ^0, (16a) 
n 

^ns{logg/go,Li(Fn)) ^ 0. (16b) 

Then, \Dk - I)i^(P,Q)| ^ 0, and /iQ(5o,?n) ^ 0. 
(b) Suppose only that (|15ap and (|16a|) hold, and 

-?^5(log^T^,Ll(P„))^0. (17) 



n 2go 

Then hQ{go,gn) ^ 0. 



To provide intuition for the conditions of Theorem [H note that the envelope condition (jl5ap is 
relatively mild, satisfied (for instance) if G is uniformly bounded from above. On the other hand, 
the envelope condition ()15bp is much more stringent. Due to the logarithm, this can be satisfied 
if all functions in G are bounded from both above and below. However, as we see in part (b), 
we do not require boundedness from below; to ensure Hellinger consistency we can drop both the 
envelope condition ()15bp and the entropy condition (|16bp . replacing them with the milder entropy 
condition (fT7|) . 

It is worth noting that both (jl6ap and (jl7p can be deduced from the following single condition: 
for all 5 > 0, the bracketing entropy satisfies 

Hf(g,Li(Q)) <oo. (18) 

Indeed, given equation ()15ap and by the law of large numbers, condition (fT8]l directly implies ()16ap . 
To establish condition (jl7p . note that by a Taylor expansion, we have 



, gi + go , 92 + go 

log — log 



2^0 2^0 



^ |gi -g2| 
90 
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i)) < ^ns{g/go,Li{Fn)). Since Go G Li(Q), we have Go/50 G 



so that i7^5(log^,Li 
In addition, 7{f{Q /go, Li(¥)) < 7{f{Q, Li{Q)) < 00. By the law of large numbers, the metric 
entropy Ti.s{G / go, Li{Fn)) is bounded in probability, so that condition (fT7|) holds. 

In practice, the entropy conditions are satisfied by a variety of function classes. Examples 
include various types of reproducing kernel Hilbert spaces j21]. as described in more detail in 
Section m as well as the Sobolev classes, which we describe in the following example. 

Example 1 (Sobolev classes). Let k = (ki,... ,K[i) be a d-dimensional multi-index, where all Kj 



are natural numbers. Given a vector x S 



define x'^ 



n- 



and \k\ 



suitably differentiable function /, let denote the multivariate differential operator 



D-fix) 



dxl' ...dx 



For a 



(19) 



and define the norm 



(X) 



Ei.i=j\DViocrdx. 



With this notation, we define the norm 



W^{X) ■= \\f\\L2{X) + \\f\\L^{X), (20) 

and the Sobolev space W^^i^) of functions with finite ||/||i4/2''(A')-iiorm. Suppose that the do- 
main X is a compact subset of W^. Let the complexity measure I be the Sobolev norm — that is, 
I{g) := \\g\\wf{x)- With this choice of complexity measure, it is known [2] that the function class 
Q defined in equation satisfies, for any 6 > 0, the metric entropy bound 



O 



< 00, 



(21) 



for all smoothness indices a > d/2. As a result, both conditions (jlSp and ()16bp hold if, for instance, 
Q is restricted to a subset of smooth functions that are bounded from above, and go is bounded 
from below. 



3.2 Proof of Theorem [T] 

We now turn to the proof of Theorem [H beginning with part (a). Define the following quantities 
associated with the function class Q: 



Dk\ 



sup 

966 



iij - sup 
log5(i(P„ 



{\oggdF-g 



+ 1) > 



(22) 



(23) 



The quantity £0 is the approximation error, which measures the bias incurred by limiting the 
optimization to the class Q. The term E-i is the estimation error associated with the class Q. Our 
focus in this paper is the statistical problem associated with the estimation error £1, and thus we 
have imposed assumption (B), which implies that the approximation error £q{Q) = 0. Moreover, 
from equations ([8]) and ()9bp . straightforward algebra yields that 



|Z?x-Di^(P,Q)| < £o{Q)+£i{Q)=£i{g). 



(24) 
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Accordingly, the remainder of the proof is devoted to analysis of the estimation error £i{G). 
In order to analyze £i{G), define the following processes: 



VniG) ■■= sup 

965 



Wn{go) ■-- 
Note that we have 



[ log ^d(Fn - P) - [{g- go)d{Qn 
J go J 

j log 90 d(Fn - P) - 50C?(Qn " Q) • 



and (25a) 
(25b) 



£l{Q)<Vn{Q)+Wn{go). (26) 

Note that the quantity Wn{go) is the difference between an empirical and a population expec- 
tation. Let us verify that the conditions for the strong law of large numbers (SLN) apply. Using 
the inequality 

j Po\ iog(po/go)l < i?A'(iP,Q) + WDk{^,Q), 

due to Csiszar (cf. [9]), it follows that log^o is P integrable. In addition, the function qq is Q inte- 
grable, since J godQ = f {po/qo)dQ = 1. Thus, the SLN applies, and we conclude that Wn{go) 0. 
By applying Theorem [5] from Appendix[Gl we conclude that Vn{G) 0. Prom the decomposition 
in equation 1^, we conclude that £i{g) ^ 0, so that \Dk - Dk(F,Q)| ^ 0. 

To establish Hellinger consistency of the likelihood ratio, we require the following lemma, whose 
proof is in Appendix |Aj 

Lemma 2. Defining the "distance" d{gQ,g) := J{g — go)dQ — log '^^dF, the following statements 
hold: 

(i) For any g ^Q, we have d{go,g) > 2h'^{g,go). 

(a) For the estimate gn defined in equation ()9ap . we have d{gQ,gn) < Vn{G)- 

The Hellinger consistency hQ{go,gn) of Theorem [TJa) is an immediate consequence of this 
lemma. 

Turning now to Theorem [T] (b), we require a more refined lemma relating the Hellinger distance 
to suprema of empirical processes. 

Lemma 3. Ifgn is an estimate of g, then: 

hl{go,gn) < 2/^^(50, '-^) <-j ^^d(Qn - + / log ^I^^IP" - n 

See Appendix [B] for the proof of this claim. To complete the proof of Theorem [H define 
G2{x) = supggg I log ^^^2g'olx) I ' ^o Lemma [3] and standard results from empirical process 

theory (see Appendix [Gl Theorem [5]) it is sufficient to prove that J G2dF < oo. To establish this 
claim, note that 

^4^-1, log2|dP 



1 G2d¥ < 


/ sup max < 




J geg I 



< log 2 + / sup \g{x 
J g&g 

where the last inequality (a) is due to envelope condition (jl5ap . 



(a) 

9o{x)\dQ, < oo, 

see 



9 



3.3 Convergence rates 



In this section, we describe convergence rates for both estimator El and estimator E2. The key 
technical tool that we use to analyze the convergence rate for the likelihood ratio estimate is 
Lemma [3l used previously in the proof of Theorem [TJ This lemma bounds the Hellinger distance 
/1(q((7o,5„) in terms of the suprema of two empirical processes writh respect to P and Q. In a 
nutshell, the suprema of these two empirical processes can be bounded from above in terms of the 
Hellinger distance, which allows us to obtain the rates at which the Hellinger distance goes to zero. 

3.3.1 Convergence rates for estimator El 

In order to characterize convergence rates for the estimator El, we require one of the following two 
conditions: 

supll^lloo < K2 (27a) 
< i^i < inf5r(x), and supg{x)<K2 for ah g^Q. (27b) 

^ X 

We also require the following assumption on function class G := {{{g + g'o)/2)^/^, (7 € G}'- for some 
constant < 7g < 2, there holds for any 5 > 0, 

nfiG,L2iq)) = Oi6-^s). (28) 

Combining this metric entropy decay rate with condition (j27ap . we deduce that for G, the bracketing 
entropy satisfies 

nfiG^L^m) = 0{d-^s). (29) 

With these definitions, we can now state a result characterizing the convergence rate of estimator 
El, where the notation Op means "bounded in probability" with respect to P measure. 

Theorem 2 (Convergence rates for estimator El), (a) If conditions (j27ap and (j28p hold, then 

/iQ(ffO,?n) = 0p(n-l/(^e+2)). 

(b) If conditions (f27bll and ([28]) hold, then \Dk --^^(P.Q)! = Op(n-i/2). 



Remarks: In order to gain intuition for the convergence rate in part (a), it can be instructive to 
compare to the minimax rate 

r„ := Jnf supEp [^(^o, ?n)], 

where the supremum is taken over all pairs (P, Q) such that go ^ G- As a concrete example, if we 
take G as the Sobolev family from Example [H and if condition (]27bj) holds, then the minimax rate 
is r„ = r2(n~^/('^"'"^)), where 7 = 7g = d/a (see Appendix [Ej) . Thus, we see that for the Sobolev 
classes, the estimator El achieves the minimax rate for estimating the likelihood ratio in Hellinger 
distance. In addition, the rate n~^^'^ for the divergence estimate is also optimal. 
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3.3.2 Convergence rates for estimator E2 

We now turn to a discussion of the convergence rates of estimator E2. To analyze this estimator, 
we assume that 

I (go) < oo, (30) 

and moreover we assume that the true likelihood ratio go — but not necessarily all of Q — is bounded 
from above and below: 

< T/o < 5o < ^1 for some constants r/O; ^i- (31) 

We also assume that the sup-norm over Qm is Lipschitz with respect to the penalty measure I{g), 
meaning that there is a constant c < cxd such that for each M > 1, we have 

sup ll^lloo < cM. (32) 

Finally, we assume that the bracketing entropy of Q satisfies, for some < 7 < 2, the condition 

nf{gM,L2{Q)) = 0[iM/5y] for any 6 > 0. (33) 
Given these assumptions, we then state the following convergence rate result for the estimator E2: 

Theorem 3. (a) Suppose that assumptions ([30]) through ([33]) hold, and that the regularization 
parameter A„ ^ is chosen such that 

= Op (n2/{2+7)) . 

Then under P, we have 

hQ{go,gn) = Op{Xl/^). (34) 
(b) Suppose that in addition to assumptions ()30p through ()33p . there holds inf mi g{x) > r]Q. Then 
we have 

\DK-DK{^,Q)\ = Ow{n-'^^). (35) 

Remarks: Note that with the choice = Op (n^/^^^'''^) and the special case of Q as the Sobolev 
space Ty^(A') (see Example [T]), estimator E2 again achieves the minimax rate for estimating the 
density ratio in Hellinger distance. 

3.4 Proof of convergence theorems 

In this section we present a proof of Theorem [3l The proof of Theorem [2] is similar in spirit, and 
is provided in Appendix [Dj The key to our analysis of the convergence rate of estimator E2 is the 
following lemma, which can be viewed as the counterpart of Lemma E) 

Lemma 4. Ifgn is an estimate of g using ()12ap . then: 

\hl{go,gn) + Y^^(9n)<-J{gn-9o)d{Qn-Q) + 1 21og^^^^d(P„-P) + ^/2(5o). 
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See AppendixOfor the proof of this lemma. Equipped with this auxihary result, we can now prove 
Theorem [3l^a) . Define fg := log^^^, and let '■= {fg \ 9 £ Gm}- Since fg is a Lipschitz 
function of g, conditions (j3ip and (j33p imply that 



nf{J'M,L2{F)) = 0{{M/5y}. (36) 

Applying Lemma 5.14 from van de Geer [26] using the distance d2{go,g) = 115 — S'o||l2(Q)' have 
that the following statement holds under Q, and hence holds under P as well, since cZP/dQ is 
bounded from above: 



sup ■ — ■ 2 = C'p(l). (37) 

n-i/2d2(<^o, 5)'"^/'(l + m + /(5o))^/2 V n"^ (1 + l{g) + /(^o)) 

In the same vein, we obtain that under P, 

l/MPn-P)| 

sup ■ ■ 2 Up(l). (38) 

see n-i/2d2(5o, 5)'"^/'(l + /(s) + /(^o))^/^ v n"^ (1 + 1(5) + I(<7o)) 

Now using condition (|32|) . it can be verified that 

^^2(50, 5) = Wa - 5o||l2(q) < 2c^l'^{\ + l[g) + /(5o))^/^/iQ(5o, <?)• 
Combining Lemma [H and equations ([38j) and (j37|) , we conclude that 

^/iQ(50,?«) + Y/'(5n) < A„/(<7o)V2 



From this point, the proof involves simple algebraic manipulation of (|39|) . To simplify notation, let 
h = hQ{gQ,gn), I = I(gn)i and Iq = I{go). We break the analysis into four cases, depending on the 
behavior of h and /: 

Case A:. In this case, we assume h > n^^/(^+''')(l + / + /q)^^^ and I > 1 + Iq- From ([39]) . we have 
either 

P/4 + XnP/2 < Op(n-l/2)/Jl-7/2jl/2+7/4 J^2/^ ^ < ^^j2/^_ 

These conditions imply, respectively, either 

h < A;i/20p(n-2/(2+7)), J < A;;iOp(n-2/(2+7)), 



or 



h<0^{Xi/^Io), /<0p(/o)- 
In either case, we conclude the proof by setting A^^-*^ = C'p(n^/'^'''+^) (1 + Iq)). 

Case B. In this case, we assume that h > n^^/(^+^') (I+I+Iq)^/^ and / < I+Iq. From equation ([39]) . 
we have either 

h^i + A„P/2 < Op(n-i/2)/ii-^/2(l + /o)V2+7/4^ or h^i + A„?/2 < A„/oV2. 
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These conditions imply, respectively, that 
or 

h<Op{Xl/^Io), and / < Op(/o). 
In either case, the proof is concluded by setting A^-*^ = Op(n^/('^+^) (1 + Iq)). 

Case C. In this case, we assume that /i < n"^/(2+^)(l+/+/o)^/^ and/ > I+Jq. From equation (1391) . 
we have 

/iV4 + A„?/2<Op(n"2/(2+7))j, 

which implies that h < Op(n~^/(2+7))/i/2 ^nd / < A^^0p(n"2/(^+'^)). Consequently, by setting 
A-i = Op(n2/(2+7))(i + Jo), we obtain 

/i<Op(Ay2)(l + /o), and /<Op(l + /o). 

Case D. In this final case, we assume that h < n^^/^'^^"'\l + / + Iq)^^"^ and / < 1 + /q, and the 
claim of Theorem [3|^a) follows. 



We now proceed to the proof of Theorem [3l^b) . Note that part (a) and equation p2p imply that 
||?n||oo = Op{l + /(gfo)). Without loss of generality, assume that < rjQ < go{x) and gn{x) < for 
all X E A". Then we have 

Dk - Di^(P,Q) = j log?„cflP„ - j gndQn - ( j loggodF - j godQ) 

< j log?„d(P„ - IP) - / dndiQn - Q) =: Ti 

Also, 

Dk - Dk{F, Q) > j ^oggodiFn -^)- j 5od(Qn - Q) =: T2 

We have T2 = Of{n~^^^) by the central limit theorem. To bound Ti, we apply a modulus of 
continuity result on the suprema of empirical processes with respect to function g and log g, where 
g is restricted to smooth functions bounded from below (by r]o) and above (by r/i). As a result, 
the bracketing entropy for both function classes Q and log^ has the same order 0(1/5)''' as given 
in (jSSp . Apply Lemma 5.13 of [26j (page 79), to obtain that for 6n = n~^^^'^^"'\ there holds: 

ri = Op(n-V2||^„_^„||l-7/2v52) ^ 0,(,-2/(2+,))^ 

thanks to part (a) of the theorem. For 7 < 2, —2/(2 + 7) < —1/2. So the overall rate is n^^^"^. 



4 Algorithmic implementation and simulation examples 

In this section, we turn to the more practical issue of implementation, focusing in particular on 
estimator E2. When Q has a kernel-based structure, we show how, via conversion to its dual form, 
the computation of the estimator E2 reduces to the solution of an n-dimensional convex program. 
We illustrate the performance of this practical estimator with a variety of simulations. 
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4.1 Algorithms for kernel-based function classes 

We develop two versions of estimator E2: in the first, we assume that ^ is a reproducing kernel 
Hilbert space (RKHS), whereas in the second, we assume that log^ forms an RKHS. In both cases, 
we focus on the Hilbert space induced by a Gaussian kernel. This choice is appropriate as it is 
sufficiently rich, but also amenable to efficient optimization procedures [22j. 

We begin with some background on reproducing kernel Hilbert spaces; see the books |2H [22] 
for further details. Consider a positive definite function K mapping the Cartesian product X x X 
to the non-negative reals. By Mercer's theorem, any such kernel function K can be associated 
with a feature map ^ : X ^ 7i, where W is a Hilbert space with inner product (•, •). Moreover, 
for all X, x' G X, the inner product in this Hilbert space is linked to the kernel via the relation 
K{x,x') = {^(x), ^{x')). As a reproducing kernel Hilbert space, any function g €z TC can be 
expressed as an inner product g{x) = {w, ^{x)), where WqW-h = WwW-}^. The kernel used in our 
simulations is the Gaussian kernel: 

K{x,y) := exp{-||x-y||Vf7}, (40) 

where || • || is the Euclidean metric in M'^, and cr > is a parameter for the function class. 



4.1.1 Imposing RKHS structure of G 

Suppose that the function class Q in estimator E2 is the Gaussian RKHS space Ti., and let the 
complexity measure be the Hilbert space norm 1(g) = WgWn- With these choices, equation (112aP 
becomes: 

1 " 1 A 

/* = min J :=min- «>(xi)) - - Vlog(u;, «>(yj)) + ^llu-llli, (41) 

1=1 3=1 

where the samples {xi\ and {yj} are i.i.d. from Q and P, respectively. The log function is extended 
to take the value — cxd for negative arguments. 

Lemma 5. The primal value f* has the equivalent dual form: 

1 " 1 1 1 1 

-1 Vlognoj + — Vaiaji^(yi,yj) + — — 2 y]^(^i'^i) " ^; — y2ajK{xi,yj) } 



-mm < 

a>0 



Moreover, the optimal dual solution a is linked to the optimal primal solution w via the relation 

^ n 1 

^ = 3-(E"^-^(y^)--E'^(^^))- (42) 

j=l 1=1 

Proof. Let il>i{w) := ^{w, $(xj)), (pj{w) := —^log{w, ^(yj)), and ^l{w) = ^WwWf^. We have 
/* = - max((0, w) - J{w)) = - J*(0) 



w 

n 



min V ipi (wi) + V [vj ) + (- V - V ) , 

1=1 j=l i=l j=l 
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where the last hne is due to the inf-convolution theorem [20]. Simple calculations yield: 



- i - i log naj if w = -aj $ ) 
+00 otherwise 



tpi{u) = if u = —^(xi) and + 00 otherwise 



1 

2 An 



Thus, we conclude that /* = -min„^ Ej=i(-^-^ log™aj) + 2i;rll Si=i Er=i 
from which the claim follows. The primal-dual relation (j42|) also follows from these calculations. □ 

For an RKHS based on a Gaussian kernel, the entropy condition (I33p holds for any 7 > (cf. 
Zhou [33J). Furthermore, condition (j32p holds since, via the Cauchy-Schwarz inequality, we have 



\gix)\ = \{w, «>(x))| < \\w\\nmx)\\H < I{g)yjK{x,x) < I{g). 

Thus, by Theorem[3ja), we have WwW-h = ||gn||w = C'p(lbollw)) so the penalty term An||w|P vanishes 
at the same rate as An. Thus, we obtain the following estimator for the KL divergence: 

"11 "1 
Dk = 1 + - -log'^^i) = (43) 

4.1.2 Imposing RKHS structure on log^ 

An alternative starting point is to posit that the function class log^ has an RKHS structure. In 
this case, we consider functions of the form g{x) = exp(t(;, <I>(x)), and use the complexity measure 
I{g) = II log(7||-H = ll^llw- Unfortunately, Theorem [3] does not apply directly because condition ([32]) 
no longer holds, but this choice nonetheless seems reasonable and worth investigating from an 
empirical viewpoint. 

A derivation similar to the previous case yields the following convex program: 



n n . 

min J := min ^ V e^'"' ^(u;, $(y,)) + — Ikllw 



ui w n — ' n 

i=\ j=i 



1=1 1=1 ]=1 



Letting a be the solution of the above convex program, the KL divergence can be estimated by: 

n 

£>x = 1 + y^Sjlogai + Silog-. (44) 



i=l 



4.2 Simulation results 

In this section, we describe the results of various simulations that demonstrate the practical viability 
of the estimators (j43p and ()44p . as well as their convergence behavior. We experimented with our 



15 



estimators using various choices of P and Q, including Gaussian, beta, mixture of Gaussians, and 
multivariate Gaussian distributions. Here we report results in terms of KL estimation error. For 
each of the eight estimation problems described here, we experiment with increasing sample sizes 
(the sample size, n, ranges from 100 to 10^ or more). Error bars are obtained by replicating each 
set-up 250 times. 

For all simulations, we report our estimator's performance using the simple fixed rate A„ ~ 1/n, 
noting that this may be a suboptimal rate. We set the kernel width to be relatively small {a = .1) 
for one-dimensional data, and choose larger a for higher dimensions. We use Ml and M2 to denote 
the estimators (j43p and (j44p . respectively. We compare these methods to algorithm A in Wang et 
al [29], which was shown empirically to be one of the best methods in the literature. This method, 
to be denoted by WKV, is based on data-dependent partitioning of the covariate space. Naturally, 
the performance of WKV is critically dependent on the amount s of data allocated to each partition; 
here we report results with s ~ n'^ , where 7 = 1/3, 1/2, 2/3. 

The four plots in Figure 14.21 present results with univariate distributions. We see that the 
estimator M2 generally exhibits the best convergence rate among the estimators considered. The 
WKV estimator performs somewhat less well, and shows sensitivity to the choice of partition size s, 
with the ranking of the different WKV estimators changing over the experiments. The performance 
of estimator Ml is comparable to that of the WKV estimator, although clearly better in the first 
plot. In Figure [4?2] we present the results with two- and three-dimensional data. Again, estimator 
M2 has the best convergence rates in all examples. The Ml estimator does not converge in the 
last example, suggesting that the underlying function class exhibits very strong bias. In these 
examples, the WKV estimator again shows sensitivity to the choice of partition size; moreover, its 
performance is noticeably degraded in the case of three-dimensional data (the lower two plots). 

It is worth noting that as one increases the number of dimensions, histogram-based methods 
such as WKV become increasingly difficult to implement, whereas increased dimension has only a 
mild effect on the complexity of implementation of our method. 



5 Some extensions 

In this section, we discuss some extensions and related estimators, all based on the same basic 
variational principle. 



5.1 Estimation of likelihood ratio functions 

Suppose that we are primarily interested in estimating the likelihood ratio function po/qo, as op- 
posed to the Kullback-Leibler divergence. In this case, we may consider any divergence functional 
D(j,(pQ,qQ), where (p is a convex function on M+, possibly different than the logarithm leading to 
KL divergence. Again applying Lemma [U choosing a different divergence leads to the following 
alternative estimator of the likelihood ratio: 



fn := argmax^g^y /(iQ„-y (/>*(/) dE„ (45) 
D4, := [ fndQn- [ (p*{fn)dFn. (46) 



The quantity /„ is an estimate of the quantity /o = d(p{qo/po), whereas Dff^ is an estimate of the 
divergence D^{po,qQ) (of secondary interest for the moment). 
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Figure 1. Results of estimating KL divergences for various choices of probability distributions. 
In all plots, the X-axis is the number of data points plotted on a log scale, and the Y-axis is the 
estimated value. The error bar is obtained by replicating the simulation 250 times. Nt{a, Ik) denotes 
a truncated normal distribution of k dimensions with mean (a, . . . , a) and identity covariance matrix. 



17 



1.5r 



Estimate of KL(Nj{0,y ,Unif[-3,3]T 




0.777712 
M1,o = .5, >. = .1/n 
M2, a = .5, >. = .1/n 



,1/3 



-x- 'WKV, n' 
' WKV, n^'^ 



500 



5000 



Estimate of KL(Nj(0,l3),Unif[-3,3]': 




Tob 2otr 



"^otr 



1.16657 

— Ml o = 1,?.= .1/n" 
^^M2,o = 1,?.= .1/n 
— e-M2,o = 1,?.= .1/n^' 
-"'WKV.n'" 

' WKV, 1^'^ 

lote fflte ate K 



1.5 



0.5 



Estimate Of KL(N|{0,l2),N,(1,l2)) 



0.959316 

-e-M1,o = .5, ;i=.1/n 
^^M2, o = .5, ).= .1/n 
-X- WKV,n"^ 
' WKV,n"^ 



500 



1000 2000 



5000 



Estimate of KL(N|(0,l3),N|(1,!g)) 



1.8 
1.6 
1.4 
1.2 
1 

0.8 
0.6 
0.4 
0.2 



-0.2 



4 — }- 




♦- - ♦ 



— 1 .43897 
^- M1,o=1,>. = .1/n 
^^M2,o=1,>. = .1/n 
-0- ' WKV, n"^ 
-«-'WKV,n"^ 

2ob 5ob lote ate 5(te Tote 



Figure 2. Results of estimating KL divergences for various choices of probability distributions. 
In all plots, the X-axis is the number of data points plotted on a log scale, and the Y-axis is the 
estimated value. The error bar is obtained by replicating the simulation 250 times. Nt{a, Ik) denotes 
a truncated normal distribution of k dimensions with mean (a, . . . , a) and identity covariance matrix. 
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We make the following observations: 

• If (/> is a differentiable and strictly convex function, i.e., > 0, then the likelihood ratio 
function Po/qo can be recovered by applying the inverse of (f)' to fn- Thus, we obtain a family 
of estimation methods for the likelihood ratio function by simply ranging over choices of 4>. 

• If (on the other hand) the function (j) is chosen to be non-differentiable, we cannot directly 
invert the mapping d(p, but we can nonetheless obtain estimators for other interesting objects. 
For instance, suppose that (p has the piecewise-linear form 




\u - 1| if > 
+00 otherwise. 



so that D(j) is the variational distance. Noting that d(j){u) = sign(n — 1) for any n > 0, we 
see that the quantity /„ in equation (j45p provides an estimate of the thresholded likelihood 
ratioH 

5.2 Extensions to different 

Let us assume that is chosen to be differentiable and strictly convex, so that we can estimate the 
likelihood ratio qq by applying {4>')~^ to /„. Since there are many such (/>, it is natural to ask how 
the properties of affect the quality of the estimate of qq. The analysis provided in the preceding 
sections can be adapted to other choices of (p, as we describe here. 

In order to describe these extensions, we first define a distance between /o and /: 

d^ikJ) ■■= D4F,Q)- J fdQ-cl)*if)dF (47) 

Note that this distance is simply the generalization of the quantity d{go,g) previously defined in 
Lem[2]). For future reference, we note the equivalence 

d^ifoJ) = l{r{f)-r{fo))dF-{f-fo)dQ 

= I (/) - fifo) - ^(/o)(/ - /o)) dF, 

where the final line uses the facts that ^^(/o) = Qo/Po and (p'{qo/po) = /o- This expression shows 
that d(f, is the Bregman divergence defined by the convex function (jf . 

^ In fact, there is strong relationship between variational distance and a threshold function of the likelihood ratio. 
Note that the conjugate dual for has the form: 

{-1 if « > -1 
viive [-1,1] 
+CXD otherwise, 

which is related to a hinge loss in the literature of binary classification in machine learning. Indeed, a binary 
classification problem can be viewed as estimating the threshold function of the likelihood ratio. See (19^ for a 
discussion of divergences and surrogate losses from this viewpoint. 
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Recall that the key ingredient in our earlier analysis was the relation between the empirical 
processes defined by equation (I25ap and the "distance" d{gQ,g) (see Lemma [2]). Similarly, the key 
technical ingredient in the extension to general (p involves relating the quantity 

= sup l\cP*{f)-cj,*{fo))di¥n-F)-[if-fo)d{Qn-C 

f<=:F J J 

to the distance ^^(/o, /) defined in equation ([^7|) . In particular, we can state the following analog 
of Lemma [2] and Lemma O 



Lemma 6. Let fn be the estimate of fo obtained by solving the problem (I45p . Then 

dMoJn) < vtiT). (48) 

Under suitable technical conditions, we have Vn{J-) 0, so that Lemma [6] implies that fn 
is a consistent estimator for /o in the sense of d^. This lemma also provides the technical means 
to derive convergence rates in the same manner as in the previous sections. Note that d^{fQ,f) is 
usually not a proper metric. To apply standard results from empirical process theory, the trick is 
that one can replace by a lower bound which is a proper metric (such as L2 or Hellinger metric). 
In the case of KL divergence, we have seen that this lower bound is the Hellinger distance (via 
Lemma [2]^i)). 

Let us illustrate this idea by stating a result about likelihood ratio estimation in terms of the 
X- square divergence^ which is defined by 



D^(P,Q) := J p'o/qodiJ. (49) 

Note that this divergence is an /-divergence with (/>(ti) = 1/n; a short calculation (see Appendix IF]) 
shows that the associated "distance" is given by d^{g,go) = J{g — go)'^dQ, which is simply the 
L2(Q) metric. With this set-up, the estimator now has the following "least square" form: 



gn := argmaxggg j -g^ dQn J dF^ 



The following theorem is an analog of Theorem [2] (with an almost identical proof) : 
Theorem 4. Assume that for some constant < 7 < 2, 

Wf(g,L2(Q)) < Ag5-\ (50) 

and moreover that condition (I27ap holds. Then the estimator 'gn obtained from the x-square diver- 
gence is consistent with rate d^{go,gn) = Op(n~^/('^+^)). 

Remark: Comparing with Theorem[2l we see that the conditions of Theorem|3]are weaker. Indeed, 
the L2{Q) metric is dominated by the Hellinger metric, so that imposing bounds on L2(Q)-metric 
and its induced entropy are milder conditions. 
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5.3 Estimation of the divergence functional D^p 

Suppose that we are primarily interested in estimating the divergence functional I?^, given that we 
have already obtained an optimal estimator Qn of the likelihood ratio function gQ = Po/ go (such as 
the one defined by (j9ap or (|12ap . or more generally (j45p ). We have demonstrated that can be 
estimated by ^ and (fT2bl) . or more generally by (|46p . Note that can be viewed as an integral 
of the likelihood ratio under the distribution Q. Indeed, we can write 

D4F,q) = j {po/qo)HQo/po) dq = j goHUgo) dQ. 

Although D(f, is an integral functional of go = Po/qo, an interesting feature here is that the integra- 
tion is with respect to unknown Q. In this section, we show that estimators such as (|9bp and (|12bp 
for the KL divergence can be viewed as a first-order Taylor expansion of the integral functional 
around the estimate gn of the likelihood ratio. This discussion is motivated by a line of work on 
the estimation of integral functional of a single density function (cf . [13^ [3] ) , and also leads to an 
open question. 

Suppose that (p ■ ^ K is a convex function differentiable up to third order, ^ is a smooth 
function class bounded from both above and below as in Example 1 (with smoothness parameter a). 
Suppose that g„ is an estimator of go (such as the one defined by (PS]) ). and that ||^„ — (7o||l2v 
Op(n~°/^^"''''^^). Using a Taylor expansion around gn, we obtain: 



g(t>{l/g) = gn(p{llgn) + {g- gn){H^/gn) - 0'(l/?„)/?„) + {g - gnf^"{l/gn)/gl + 
0{{g-gnf) 

= </.'(l/5n) + ^"{l/gn)/gn + g{4>{^/gn) - 0'(l/?n)/?n " 2^" {I /gn) /Z) + 
5V'(l/?n)/?'+0((9-?nf). 



We arrive at 



L'^(P,Q) = j gct>{l/g)dQ 

= j ((A'(l/5n) + <A"(l/5n)/?n) dQ 

+ j (0(l/5n) - 0'(l/?n)/?n " 2</."(l/^„)/^2 ) dF 

+ [ Pl/qo ^"{l/gn)/gl d^i + 0{\\go - gnWl). 



In the above expression, the first two integrals can be estimated from two n-samples of empirical 
data drawn from P and Q. Because of the boundedness assumption, these estimations have at most 
Op(n~^/^) error. The error of our Taylor approximation is O(||go "Snlll) = Op(n~'^"/(^°"^°')). This 
rate is less than ©(n""*^/^) for a > d/A. Thus when a > d/4, the optimal rate of convergence for esti- 
mating D^p hinges on the convergence rate for estimating the integral of the form J Po/qoip dfi. This 
is interesting because we have reduced the problem of estimating any /-divergence to a particular 
integral of two densities J Po/qo'4' dfJ-, where tp is a, known function. 

Let us return to the case of KL divergence, i.e., (/>(u) = — logn. If we use Taylor approximation 
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up to first order, the estimator has the following form: 

= J i^iyan) - <i>'{l/9n)/gn) dF„, + J </.'(!/?„) dQn 
= J log gn + IdFn - dndQn, 

which has exactly the same form as our original estimator (j9b|) . except that here gn can be any 
(optimal) estimator of the likelihood ratio. Note that we have shown the optimal convergence 
rate of n~^/^ for the KL divergence estimator, given a > d/2 (so that 7 = d/a < 2). Questions 
regarding the estimator and its analysis for the case a < d/2 remain unexplored. In particular, 
for the regime a G [d/4:,d/2], the optimal rate of n"^/^ for estimating KL divergence (and -D^ in 
general) is certainly achievable by using Taylor expansion up to second order, assuming that a 
separate method exists to achieve the optimal rate n~^/^ for the integral J p^/qoi^ dfx. 



6 Conclusions 

We have developed and analyzed M-estimation methods for both the likelihood ratio and /- 
divergence functionals of two unknown multivariate probability distributions by exploiting a vari- 
ational characterization of /-divergence functionals. The methods are shown to be amenable to 
efficient computational algorithms for optimization in high-dimensional function spaces. We have 
also described our method in the general context of estimating integral functionals of the likelihood 
ratio of two unknown densities, and discussed directions for future work suggested by our results. 
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A Proof of Lemma [2] 

(i) Note that for x > 0, ^ logx < ^ - 1. Thus, flog-^dF<2f {g^/'^g^^''^ - 1) dF. As a result, 

d(5o, g) > y (5 - 50) - 2 y {g^/^go^^^ - l) dF 
{g-go) dQ-2 j\g^'^gl'^ -go) dQ 
\g"' - gl"?dQ. 
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(ii) By our estimation procedure, we have J ^„(iQri — / logg„dP„ < f godQn — J ^og godPn- It follows 
that 

d{go,gn) = jidn- go)dQ - j (log gn - loggo)dF 

< [idn- 9o)d{Q - Qn) - / (log^„ - loggo)d{¥ - ¥n) 



< sup / log — (i(P„ - P) - f{g- go)d(Qn 
geg J 90 J 



B Proof of Lemma [3] 

The first inequality is straightforward. We shall focus on the second. By the definition of our 
estimator, we have: 



loggndPn < j godQn " J loggodFn- 

Both sides are convex functionals of g. Use the following fact: If F is a convex function and 
F{u) < F{v), then F{{u + v)/2) < F{v). We obtain: 

I h±9^dqn - J log ^^dP„ < I godQn - I log^odPn. 

Rearranging, 

I ^-I^diQn - Q) - 1 log - IP) ^ / ^^^P - / ^^^dQ 

= -digo,'-^)<-2h^^{go,'-^), 
where the last inequality is an application of Lemma [2l 

C Proof of Lemma [4] 

Define di{go,g) = Jig- go)dQ - log j^dF. Note that for x > 0, i log j; < - 1. Thus, 

/ log ^ dP < 2 / (5'/'5o - 1) d¥. 
J 90 J 

As a result, for any g, di is related to /jq as follows: 

di{go,9) > j{g-go)dQ-2j {g^'^g^^^^ - I) dF 

= j{g-go)dq-2j {g^'^gl'^ - 50) dQ = | {g^'^ - g^fdq 
= '^hi{go,g). 
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By the definition (jl2ap of our estimator, we have: 



dndQn 



Both sides are convex functionals of g. By Jensen's inequahty, if F is a convex function, then 
F{{u + v)/2) - F{v) < {F{u) - F{v))/2. We obtain: 



9n+90 



log ^^dFr. + ^I\9r.) < I 50dQ„ - I log50dP„ + 



Rearranging, / ^d(Q„ - Q) - / log ^(i(P„ - P) + < 

log '-^dF - I '-^dQ + ^l\9o) = -d,igo, '-^) + ^I'igo) 

< -2hl{go, ^4^) + ^iHgo) < -lhl{go,9n) + ^I'igo), 

where the last inequality is a standard result for the (generalized) Hellinger distance 
(cf. [26]). 

D Proof of Theorem [2] 

(a) One of the empirical processes on the right-hand side of Lemma [3] involves function class 
J- := log For each g G G, let fg := log We endow with a "norm," namely, Bernstein 

distance. This is defined as follows: for a constant K > 0, 



PkU? ■■= 2K^ j (el^l/^ - 1 - |/|/i^)dF. 



The Bernstein distance is related to the Hellinger distance in several crucial ways (see, e.g., [26], 
page 97): 

• Pi(/<,)<4/iq(<7o,^). 

• The bracketing entropy based on Bernstein distance is also related to the bracketing entropy 
based Hellinger distance (i.e., which is the L2 norm for the square root function): 

n%{T,pi)<nnQ.L2mi (51) 

where Q := {{[g + go)/2)i/2^ g G G} and g := {g + go)/2. 
By Lemma [3l for any 6 > 0, with respect to P measure: 
P{hQ{go,gn) >6)< P{hQ{go,{gn+go)/2) > 6/ A) 

< P( sup -fig- go)d{Qn -Q)+ [ fg d{Fn - P) - 2/1^(50, 5) > 

< p( sup - [{g-go)d{Qn-Q)-h^Q{go,9)>0 

\g&S,hQ{go,g)>5/4 J 

+ p( sup [ fgd{¥n-r)-hligo,g)>o) ■.= A + B. 

\g&S,hQ{go,g)>5/4J J 
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We need to upper bound the two quantities A and B on the right-hand side of this equation. These 
can be handled in a similar manner. Since 7if{Q,L2{Q)) < oo the diameter of G is finite. Let 
S be the minimum s such that 2^~^^6/4: exceeds that diameter. We apply the so-called peeling 
device: Decompose Q into layers of Hellinger balls around go and then apply the union bound on 
the probability of excess. For each layer, one can now apply the modulus of continuity of suprema 
of an empirical process. 

B < Vpf sup [ fgd{¥n-f)>2^'{6/Af 

Note that if hQ{go,g) < 2'*+^(5/4 then pi{fg) < 2'*+^5. Note that for any s = 1, . . . , 5, the bracketing 
entropy integral can be bounded as: 

r \f{J^n{hQ{go,g) < 2^+^5/4}, Pi)'/' de 
Jo 

< / C9(e/^)-^e/2 
Jo 

where Cs^Cg are constants independent of s. Now apply Theorem [6] (see Appendix iGll. where 
K = 1, R = 2^+1(5, a = Ciy/7iR^/K = Ci^2'^^'+^^'^ . We need 

a > CoC8(2"+i5)i-^e/2 > CqR. 

This is satisfied if 6 = n'^^^^s^"^^ and Ci = CqCs, where Cg is sufficiently large (independently of 
s). Finally, > C^{Ci + 1) = C^iCoCs + 1) if Cq := 2C^C8 V 2C, where C is some universal 
constant in Theorem [H Applying this theorem, we obtain: 

s 

B < ^Cexp 

s=0 

for some universal constant c. A similar bound can be obtained for A, with respect to Q measure and 
with 5 = n~^/('^6+2). Since po/qo is bounded from above, this also implies a probability statement 
with respect to P. Thus, hQ{gQ,gn) is bounded in P-probability by n~^/('^5+2). 
(b) The proof is similar to Theorem [3]I^b) and is omitted. 



C^Ci + 1) 



< cexp 



n5^ 

T2~ 



E Comparison of the rate in Lemma [2] to the minimax rate 

Recall that the minimax rate is defined as 

rn ■■= Jnf supEp [/iQ(5to,?n)], 

where the supremum is taken over all pairs (P, Q) such that go £ Q. Note that r„ > inf^^gg supp E/i^((7o 
where we have fixed Q = p, the Lebesgue measure on X. We can reduce this bound to the minimax 
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lower bound for a nonparametric density estimation problem [32]. This reduction is not entirely 
straightforward, however, because the space Q ranges over smooth functions that need not be valid 
probability densities. Therefore, an easy-to-use minimax lower bound such as that of ^31j is not 
immediately applicable. Nonetheless, we can still apply the hypercube argument and the Assouad 
lemma to obtain the right minimax rate. See §24.3 of van der Vaart [27] for a proof for the case of 
one dimension. This proof goes through for general d>l. 



F Some calculations for Theorem [4] 



Note that the conjugate dual of (l){u) = 1/u takes the form 

—2J—V if w < 0, and 



-|-oo 



otherwise. 



Consequently, we can restrict T to the subset for which / < for any f ^ T . Let g := \/—f and 
G = \/ —T . ^ is a function class of positive functions. We have go := V~fo = \/—(p'{Qo/Po) = Po/qo- 

Define gn '■= \J —fn- We also replace notation (i^(/o,/) by d^{go,g). For our choice of (f>, we have: 
d^{go,g) = d^{h,f) = j{-2y^ + 2y^o)dF-{f-fo)dQ 

(go - 5)(2po/go -50 - 

(g - gofdQ, 



as claimed. Moreover, we have 



Vni^) = sup 

9&g 



2ig' 



J{g- go)d{¥n - P) 



G Results from empirical process theory 



For completeness, we state here two standard results from empirical process theory that are needed 
in the paper. These results are versions of Theorems 3.7 and 5.11 from van de Geer [26|, respectively: 

Theorem 5. Let G{x) = sup^gg \g{x)\ he the envelope function for a function Q. Assume that 
J GdF < oo, and suppose moreover that for any 6 > 0, -Hs{G, -Li(Pn)) — > 0. Then sup^gg / gd(Fn- 
P) 0. 

Theorem 6. Suppose that the function class Q satisfies supg^g pxig) < R for some constants K 
and R. Given a > 0, suppose that for some constants G and Gi, there holds 

a < Gi^R^/K 



a > y^G^{Gi + i) Q n^{g,pKf'^duy R 



Then the empirical process is hounded as 



sup 



P)| > a 



< G exp 



C2(Ci + l)i?2 



(52) 
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